DO NOT MERGE - CI sandbox for stateless scheduler b#25172
Conversation
|
/tag-and-rerun-ci |
There was a problem hiding this comment.
Code Review
This pull request refactors the chunked prefill mechanism by replacing the global chunked_req pointer with per-request state flags (has_pending_chunk and pending_middle_outputs). This change enhances support for pipeline parallelism and ensures more robust state management across iterations. The review feedback identifies potential null pointer crashes, logic inconsistencies in request abort handling, and performance optimizations for hot-path queue scans, all of which include actionable code suggestions.
| for mb_list in (self.mbs, self.last_mbs, self.running_mbs): | ||
| for mb in mb_list: | ||
| if mb is not None and not mb.is_empty(): | ||
| batch_reqs.extend(mb.reqs) |
There was a problem hiding this comment.
The iteration over self.mbs, self.last_mbs, and self.running_mbs will crash if any of these attributes are None. While self.mbs is typically a list, last_mbs and running_mbs are often None in certain scheduler states or configurations.
| for mb_list in (self.mbs, self.last_mbs, self.running_mbs): | |
| for mb in mb_list: | |
| if mb is not None and not mb.is_empty(): | |
| batch_reqs.extend(mb.reqs) | |
| if self.pp_size > 1 and hasattr(self, "mbs"): | |
| for mb_list in (self.mbs, self.last_mbs, self.running_mbs): | |
| if mb_list is not None: | |
| for mb in mb_list: | |
| if mb is not None and not mb.is_empty(): | |
| batch_reqs.extend(mb.reqs) |
| if (recv_req.abort_all or req.rid.startswith(recv_req.rid)) and ( | ||
| req.rid not in batch_rids | ||
| ): |
There was a problem hiding this comment.
Aborted requests that are currently in a batch (e.g., chunked-resume requests) should still be removed from the waiting_queue list to maintain a consistent scheduler state. The current logic skips them entirely. To avoid double-releasing resources, you should remove them from the list but skip the release_kv_cache call inside the processing loop (by checking if req.rid in batch_rids: continue).
if (recv_req.abort_all or req.rid.startswith(recv_req.rid)):| # priority + has_pending_chunk make it sit at the head, but its | ||
| # presence relaxes the "is queue empty / pool full" early exits below | ||
| # (we must keep scheduling it to make progress, or memory leaks). | ||
| has_chunked_resume = any(r.has_pending_chunk for r in self.waiting_queue) |
There was a problem hiding this comment.
Performing an waiting_queue using any() in the scheduler's hot path is inefficient. Since the scheduling policy ensures that has_pending_chunk requests are sorted to the front of the queue, you can optimize this by checking only the first element.
| has_chunked_resume = any(r.has_pending_chunk for r in self.waiting_queue) | |
| has_chunked_resume = self.waiting_queue[0].has_pending_chunk if self.waiting_queue else False |
| chunked_resume = next( | ||
| (r for r in self.waiting_queue if r.has_pending_chunk), None | ||
| ) |
There was a problem hiding this comment.
|
/rerun-failed-ci |
|
/rerun-test stage-b-test-2-gpu-large |
|
⛔ |
|
/rerun-failed-ci |
5 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
|
|
CI status snapshot (pre-rebase)Current attempt 5 results: 107 success / 9 failure / 16 skipped / 2 still queued. Per-failure classification (CUDA-lane only; AMD/NPU skipped per project policy):
Plan: rebasing |
When chunked-resume reqs are held in both waiting_queue and batch.reqs (stateless-scheduler refactor), abort_request would otherwise process them twice (queue pop + to_finish), causing duplicate send_output and double release_kv_cache. Build batch_rids upfront and skip waiting_queue removal for reqs already in batch — let to_finish path handle them. Pre-flight for stateless-scheduler v2.
For chunked-resume reqs (after the upcoming stateless-scheduler switch) that live in waiting_queue with non-empty prefix_indices, summing req.seqlen overcounts the committed prefix. Switch to seqlen - prefix for waiting reqs; keep the chunked_req block until that field is removed. Today's behavior is unchanged for fresh waiting reqs whose prefix_indices is empty. Pre-flight for stateless-scheduler v2.
|
/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py |
CUDA
|
|
🚀 |
LoRA Qwen3-8B
|
|
/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py |
|
🚀 |
AMD 2-GPU
|
AMD MI300 lane: two near-threshold perf assertion flakes — not ours, not blockingRun on
Both already retried internally once and failed twice; classic AMD MI300 hardware-noise perf threshold flake territory. Our diff is scheduler-side chunked-resume bookkeeping with no AMD / MoE / serving-perf code paths touched. |
Non-CUDA lane failures on
|
| Lane | Job | Cause |
|---|---|---|
| AMD MI300 | stage-b-test-1-gpu-small-amd (3) |
HW Exception by GPU node-2 ... reason: GPU Hang (hardware) |
| AMD MI300 | stage-b-test-1-gpu-small-amd (6), (9) |
likely same GPU-hang signature |
| AMD MI35x | stage-b-test-1-gpu-small-amd-mi35x, stage-b-test-large-8-gpu-...-disaggregation-amd, stage-c-test-large-8-gpu-amd-mi35x (0), (1) |
MI35x lane failures |
| AMD MI300 | stage-c-test-4-gpu-amd (0), stage-c-test-large-8-gpu-amd (1) |
non-blocking AMD stage-c |
| NPU | stage-b-test-16-npu-a3 |
test_npu_deepep.py failed in NPU-specific code path |
| NPU | pr-test-npu-finish |
meta cascade |
None plausibly caused by this PR's diff (scheduler-side chunked-resume bookkeeping with no AMD, no NPU, no deep-EP, and no kernel-level paths). Per sglang-babysit-ci skill: non-CUDA lanes are only fixed when clearly ours AND easy. Not blocking the CUDA gate.
|
/rerun-test test/registered/8-gpu-models/test_deepseek_v32_indexcache.py |
|
|
🚀 |
|
/rerun-test test/registered/dsv4/test_deepseek_v4_flash_fp4_megamoe_b200.py |
|
|
🚀 |
schedule_batch.py: drop self.maybe_wait_verify_done() call in merge_batch — upstream removed verify_done.wait via FutureMap routing (sgl-project#25879); keep our branch's assert against chunked/dllm reqs in other.reqs. test/registered/unit/managers/test_scheduler_chunked_req_gate.py: keep HEAD's deletion (v1 gate removed in v2); upstream's array.array migration is moot since the file goes away.
Adds a by-rid dict tracking sync-mode reqs scheduler currently owns the lifecycle of (admitted, not finished, not retracted). Runs as a parallel tracker alongside existing waiting_queue / running_batch without changing scheduler behavior. DEBUG_INVARIANTS=1 enables _assert_invariants checks at get_next_batch_to_run boundaries. Part of waiting_queue refactor plan, commit 1/7. See agent-drafts/ 2026-05-25-waiting-queue-refactor-plan.md.
Eliminates H3 hack (Stage A scanning the full waiting_queue to find chunked-resume reqs). Now scans the chunked_reqs() view derived from active_reqs. Behavior identical to C1 because C1's retention keeps waiting_queue and active_reqs in sync for chunked-resume reqs. Part of waiting_queue refactor plan, commit 2/7.
…(C3) Adds an inline chunked admission block at the top of _get_new_batch_prefill_raw that consumes chunked_reqs() directly. Strips has_pending_chunk branches from the main waiting_queue loop (H6 LoRA drainer bypass, H7 init_next_round_input split). The waiting_queue retention for chunked-resume is still in place; it is removed in C4. Single-flight assertion enforced at the inline admission entry. Part of waiting_queue refactor plan, commit 3/7.
Chunked-resume reqs no longer anchor in waiting_queue (H2 hack elimination). The retention `or x.has_pending_chunk` is removed; the transitional guard added in C3 to prevent double-admit is also removed. After this commit, chunked-resume reqs live exclusively in active_reqs and are re-admitted via the inline block at the top of _get_new_batch_prefill_raw. Part of waiting_queue refactor plan, commit 4/7.
…d bypasses (C5) Now that chunked-resume reqs live in active_reqs (post-C4), the defensive bypasses that scanned waiting_queue for has_pending_chunk become dead code. Eliminates H4 (early-exit has_chunked_resume scan), H5 (dynamic-chunking lookup), AB7 (_abort_on_waiting_timeout has_pending_chunk skip), plus a stale comment referencing the deleted retention. Single chunked_in_active computation reused throughout _get_new_batch_prefill_raw. Part of waiting_queue refactor plan, commit 5/7.
Eliminates H1 (dual-existence comment) and H8 (defensive has_pending_chunk / pending_middle_outputs reset on waiting-segment orphan release). Post-C4 chunked-resume reqs no longer live in waiting_queue, so the waiting-segment orphan branch is narrowed to mamba-pool reqs only. Critical: the active-segment loop now iterates active_reqs instead of batch_reqs, distinguishing in-batch reqs (FINISH_ABORT via batch result path) from stashed chunked-resume reqs (immediate release + _deactivate, audit finding 2). Without this, aborting a chunked- resume mid-prefill outside of any current batch would leak row + KV + lock_ref. Part of waiting_queue refactor plan, commit 6/7.
- Tightens _assert_invariants: waiting_queue and active_reqs are now strictly disjoint (sync mode); C1's relaxed transitional clause removed. - Removes C1's _activate idempotency filter at the admission call site; the main admission loop no longer produces re-admits after C3/C4. - Adds comprehensive invariant documentation as field-level comments on Scheduler.active_reqs and method docstring on chunked_reqs(). - Migrates pause_generation(retract) chunked release path to iterate chunked_reqs() instead of scanning waiting_queue (dead post-C4), and flags a pre-existing latent bug (req not re-enqueued after reset_for_retract). Concludes the waiting_queue refactor chain (commit 7/7). See agent-drafts/2026-05-25-waiting-queue-refactor-plan.md and audit.
Review of C1-C7 revealed two P0 bugs and one P1: 1. _activate fired unconditionally in _get_new_batch_prefill_raw, enrolling disagg PREFILL and DLLM reqs into active_reqs. Neither path has a corresponding _deactivate (disagg PREFILL uses process_batch_result_disagg_prefill; DLLM uses dllm/mixin paths), leaking active_reqs entries indefinitely and crashing abort_all via the new stashed-chunked assert (C6). 2. flush_cache cleared tree cache / pool but not active_reqs, leaving stale dict entries pointing at freed req_pool_idx. Fix: gate _activate at the helper itself (single point of control) to enforce the "sync-mode non-DLLM only" invariant that the plan + audit always assumed but code didn't enforce. flush_cache.clear() ensures the dict is reset alongside other ownership pools. Also: rewrite two stale comments referencing pre-C4 waiting_queue retention. Part of waiting_queue refactor chain, commit 8/7 (post-review fix).
…(C9) C3 inlined the body of upstream's `PrefillAdder.add_chunked_req` into `_get_new_batch_prefill_raw` to avoid resurrecting the special method. But `add_one_req` already supports chunked-resume via its `is_resume` path (`has_pending_chunk and not is_dllm`), which gates: - budget_prefix=0 (no prefix double-count) - skip _req_inc_lock_ref (already held from prior admission) - update has_pending_chunk = truncated So the inline manual budget code was a copy of logic that `add_one_req` already encapsulates. C9 replaces the ~30-line inline block with a single `adder.add_one_req(chunked_req, ...)` call; chunked admission still runs BEFORE the main waiting_queue loop so it skips LoRA drainer / hicache prefetch checks that don't apply to in-flight chunked. Removes scheduler.py access to PrefillAdder protected methods (`_get_dllm_remain_tokens`, `_update_prefill_budget`) — these stay encapsulated. Behavior change: `prefill_delayer_single_pass` / `prefill_max_requests` / `dsa_prefill_cp_in_seq_split` early-exit gates now apply to chunked too. Safe in practice: chunked runs first so can_run_list is empty for `_max_requests` / `cp_in_seq_split` checks; prefill_delayer blocking chunked just delays one iter. Part of waiting_queue refactor chain, commit 9/7.
…LL leak (C10) Two motivations: 1. BUG: C8's `_activate` gate excluded ALL disagg modes, but disagg PREFILL shares _get_new_batch_prefill_raw with sync — chunked-resume reqs were admitted, then orphaned (out of waiting_queue per C4, not in active_reqs per C8), leaking row + KV + lock_ref. Fix: gate to DECODE only (which has its own prealloc/transfer queue ownership), then wire _deactivate at disagg/prefill.py's three release_kv_cache sites and migrate its Stage A loop to chunked_reqs(). 2. CLEANUP: post-C4 chunked-resume never lives in waiting_queue, but several supporting files still split waiting_queue by has_pending_chunk (schedule_policy.py 3 sites, pool_stats_observer.py, invariant_checker.py, several stale comments). Revert/migrate to read active_reqs. DECODE mode is still excluded from active_reqs (Q1=(c)); only PREFILL is now correctly tracked. Part of waiting_queue refactor chain, commit 10/7.
C10 narrowed _activate's gate to DECODE-only, so disagg PREFILL chunked-resume reqs now enter active_reqs and can be reached by the abort_request active段 stashed-chunked branch (C6). But that branch only does release_kv_cache + _deactivate — missing two pieces of disagg PREFILL cleanup that pause_generation(retract) does correctly: 1. disagg_kv_sender.abort() — without this, the peer decode node waits forever for the remaining chunks (hang). 2. release_req_to_metadata_buffer() — metadata buffer slot leak. Mirrors pause_generation(retract) PREFILL handling and abort_request waiting段 PREFILL handling. Also: clean stale "assert above" comment in disagg/decode.py (the assert was deleted in C10). Part of waiting_queue refactor chain, commit 11/7.
Motivation
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ciCI States
Latest PR Test (Base): Not run yet⚠️ Not run on latest push -- push again to dispatch.
Latest PR Test (Extra):